{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "# Benchmarking Python Code Generation with Vanilla, ONNX Converted and Quantized CodeGen Models\n",
        "This notebook is a companion of chapter 6 of the \"Domain Specific LLMs in Action\" book, author Guglielmo Iozzia, [Manning Publications](https://www.manning.com/), 2024.  \n",
        "The code in this notebook is to benchmark inference performance (latency and throughtput) when generating Python code using a Vanilla [CodeGen](https://huggingface.co/Salesforce/codegen-350M-mono) 350M mono model, after ONNX conversion of the same model and after 8-bit quantization. It doesn't require hardware acceleration.  \n",
        "More details about the code can be found in the related book's chapter."
      ],
      "metadata": {
        "id": "ro2mCsr9wNoZ"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Install the missing requirements in the ColabVM (only Optimum for the ONNX runtime)."
      ],
      "metadata": {
        "id": "uxm-Oc-FxRGx"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "!pip install optimum[onnxruntime]==1.21.2"
      ],
      "metadata": {
        "id": "qBniEqbYvTsI"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Update the Transformers library to the latest version. A runtime restart is needed after."
      ],
      "metadata": {
        "id": "Lh5mdYf_V4Kf"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "!pip install -U transformers"
      ],
      "metadata": {
        "id": "Nh1Jwr5XQI4Z"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Vanilla Model"
      ],
      "metadata": {
        "id": "8Qgo8j3yAEC_"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Download the CodeGen 350 M mono model and its tokenizer from the HF's Hub."
      ],
      "metadata": {
        "id": "o3uieL3kxc-C"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from transformers import AutoTokenizer\n",
        "\n",
        "device = \"cpu\"\n",
        "model_id = \"Salesforce/codegen-350M-mono\"\n",
        "tokenizer = AutoTokenizer.from_pretrained(model_id)"
      ],
      "metadata": {
        "id": "oQXJ5wJJuMUd"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "from transformers import CodeGenForCausalLM\n",
        "\n",
        "model = CodeGenForCausalLM.from_pretrained(model_id).to(device)\n",
        "model.eval()"
      ],
      "metadata": {
        "id": "9JpoSzoHaOR4"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Set a text prompt (a Python function header) to be used across benchmarks."
      ],
      "metadata": {
        "id": "NlUHkMBOxtGL"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "prompt = \"def hello_world():\""
      ],
      "metadata": {
        "id": "_Euq2A50ANLW"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "The code in the following cell is just to verify that model and tokenizer have been downloaded properly. You can skip its execution."
      ],
      "metadata": {
        "id": "2QIHUGmFx1EB"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "input_ids = tokenizer(prompt, return_tensors=\"pt\").input_ids\n",
        "generated_ids = model.generate(input_ids, max_length=12)\n",
        "print(tokenizer.decode(generated_ids[0],\n",
        "                       skip_special_tokens=True,\n",
        "                       pad_token_id=50256))"
      ],
      "metadata": {
        "id": "3_AXjHFvaTO8"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Setup a Transformers' pipeline for inference with the Vanilla model."
      ],
      "metadata": {
        "id": "td1Nl5uUx_6v"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from transformers import pipeline\n",
        "\n",
        "pipe = pipeline(\"text-generation\",\n",
        "                model=model,\n",
        "                tokenizer=tokenizer,\n",
        "                pad_token_id=50256,\n",
        "                truncation=True,\n",
        "                max_length=12\n",
        "      )"
      ],
      "metadata": {
        "id": "HTBIqZ9MmF6P"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Test the pipeline."
      ],
      "metadata": {
        "id": "K3x3oWkbyHFw"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "result = pipe(prompt)\n",
        "print(result[0]['generated_text'])"
      ],
      "metadata": {
        "id": "p4NxE8a1m6xE"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "tokenizer.save_pretrained(\"local-pt-checkpoint\")\n",
        "model.save_pretrained(\"local-pt-checkpoint\")"
      ],
      "metadata": {
        "id": "DIIsBPsmudM3"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Define some utils for benchmarking (more details about them in chapter 6 of the book)."
      ],
      "metadata": {
        "id": "q70kfNsKCeYx"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from contextlib import contextmanager\n",
        "from dataclasses import dataclass\n",
        "from time import perf_counter\n",
        "\n",
        "@contextmanager\n",
        "def track_infer_time(time_buffer):\n",
        "    start_time = perf_counter()\n",
        "    yield\n",
        "    end_time = perf_counter()\n",
        "\n",
        "    time_buffer.append(end_time - start_time)\n",
        "\n",
        "@dataclass\n",
        "class BenchmarkInferenceResult:\n",
        "    model_inference_time: [int]\n",
        "    optimized_model_path: str"
      ],
      "metadata": {
        "id": "pL77v2MCEhNx"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Define a custom funtion to be reused across benchmarks with the different versions of the model under evaluation."
      ],
      "metadata": {
        "id": "il5ctRs0yU7Y"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from tqdm import trange\n",
        "\n",
        "def benchmark_inference(providers_dict, pipe, prompt, results):\n",
        "  for device, label in PROVIDERS:\n",
        "      for _ in trange(10, desc=\"Warming up\"):\n",
        "        pipe(prompt)\n",
        "\n",
        "      time_buffer = []\n",
        "      for _ in trange(100, desc=f\"Tracking inference time ({label})\"):\n",
        "        with track_infer_time(time_buffer):\n",
        "          pipe(prompt)\n",
        "\n",
        "      results[label] = BenchmarkInferenceResult(\n",
        "          time_buffer,\n",
        "          None\n",
        "      )\n",
        "\n",
        "  return results"
      ],
      "metadata": {
        "id": "kROG4REwFO2d"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Execute the benchmarks for the CodeGen vanilla model."
      ],
      "metadata": {
        "id": "Y9-RNdkvyfQ1"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "results = {}\n",
        "PROVIDERS = {\n",
        "    (\"cpu\", \"PyTorch CPU\"),\n",
        "}\n",
        "results = benchmark_inference(PROVIDERS, pipe, prompt, results)"
      ],
      "metadata": {
        "id": "dL9m9HtcFsP9"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "### ONNX Conversion"
      ],
      "metadata": {
        "id": "AHSowqvkAH2K"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "To prevent potential out of memory issues, let's delete the original model from memory."
      ],
      "metadata": {
        "id": "Ze83zNmcy0jR"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import gc\n",
        "\n",
        "del model\n",
        "gc.collect()"
      ],
      "metadata": {
        "id": "sF_f0cr9HA6p"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Convert the CodeGen 350M mono model using the Optimum package."
      ],
      "metadata": {
        "id": "v4ohyCiRy9uO"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from optimum.onnxruntime import ORTModelForCausalLM\n",
        "\n",
        "model_id = 'Salesforce/codegen-350M-mono'\n",
        "model = ORTModelForCausalLM.from_pretrained(model_id,\n",
        "                                            export=True,\n",
        "                                            provider=\"CPUExecutionProvider\")"
      ],
      "metadata": {
        "id": "FZciN9xVvUzp"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Save the converted model to disk."
      ],
      "metadata": {
        "id": "XPEzFKXFzGRO"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from pathlib import Path\n",
        "\n",
        "onnx_path = Path(\"onnx\")\n",
        "model.save_pretrained(onnx_path)"
      ],
      "metadata": {
        "id": "biYmiE5e3h6Q"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Setup a pipeline for inference with the ONNX converted CodeGen 350M mono model."
      ],
      "metadata": {
        "id": "VJje5ki3zRnu"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "pipe = pipeline(\"text-generation\",\n",
        "                model=model,\n",
        "                tokenizer=tokenizer,\n",
        "                pad_token_id=50256,\n",
        "                truncation=True,\n",
        "                max_length=12\n",
        "                )"
      ],
      "metadata": {
        "id": "APwRbKX8anHK"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Verify that the pipeline works."
      ],
      "metadata": {
        "id": "OWjihgZ7zYEB"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "result = pipe(prompt)\n",
        "result"
      ],
      "metadata": {
        "id": "e8wUQlWczdiN"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Repeat the benchmark on the ONNX converted model."
      ],
      "metadata": {
        "id": "aAanMvXczfCd"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "PROVIDERS = {\n",
        "    (\"CPUExecutionProvider\", \"ONNX CPU\"),\n",
        "}\n",
        "results = benchmark_inference(PROVIDERS, pipe, prompt, results)"
      ],
      "metadata": {
        "id": "OnPpHEKoIGxk"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "### 8-bit Quantization"
      ],
      "metadata": {
        "id": "NKlAIq_oz81N"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "To prevent potential out of memory issues, let's delete the pipeline from memory."
      ],
      "metadata": {
        "id": "KESvr_wOzy7i"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "del pipe\n",
        "gc.collect()"
      ],
      "metadata": {
        "id": "TQUAblExJWvm"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Do dynamic 8-bit quantization of the ONNX converted model and save it to disk."
      ],
      "metadata": {
        "id": "nijBTdmv0V41"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from optimum.onnxruntime import ORTQuantizer\n",
        "from optimum.onnxruntime.configuration import AutoQuantizationConfig\n",
        "\n",
        "dynamic_quantizer = ORTQuantizer.from_pretrained(model)\n",
        "dqconfig = AutoQuantizationConfig.avx512_vnni(is_static=False,\n",
        "                                              per_channel=False)\n",
        "\n",
        "model_quantized_path = dynamic_quantizer.quantize(\n",
        "    save_dir=onnx_path,\n",
        "    quantization_config=dqconfig,\n",
        ")"
      ],
      "metadata": {
        "id": "YfM7NeIb2vqR"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Load the quantized model in memory before setting the pipeline for it."
      ],
      "metadata": {
        "id": "yusDTLB00jXL"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "quantized_model = ORTModelForCausalLM.from_pretrained(\"onnx\", file_name=\"model_quantized.onnx\")"
      ],
      "metadata": {
        "id": "3fBJg0ryPOnP"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Setup the pipeline for inference with the quantized model."
      ],
      "metadata": {
        "id": "gUosoY7N0o3C"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "pipe = pipeline(\"text-generation\",\n",
        "                model=quantized_model,\n",
        "                tokenizer=tokenizer,\n",
        "                pad_token_id=50256,\n",
        "                truncation=True,\n",
        "                max_length=12\n",
        "                )"
      ],
      "metadata": {
        "id": "6GQypqF7Qco_"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Verify that the pipeline works as expected."
      ],
      "metadata": {
        "id": "gW0GqH1E17dz"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "result = pipe(prompt)\n",
        "result"
      ],
      "metadata": {
        "id": "ehBfVq_CQq28"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Repeat the benchmark on the quantized model."
      ],
      "metadata": {
        "id": "WGUi3AaYDC-I"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "PROVIDERS = {\n",
        "    (\"CPUExecutionProvider\", \"ONNX Quant CPU\"),\n",
        "}\n",
        "results = benchmark_inference(PROVIDERS, pipe, prompt, results)"
      ],
      "metadata": {
        "id": "l4cTBDxVIu4X"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Results of the Benchmarks"
      ],
      "metadata": {
        "id": "56NkSjpODGv5"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Visually compare the average inference times across benchmarks for the 3 different versions of the model."
      ],
      "metadata": {
        "id": "8ak1DWKV2Y5p"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import numpy as np\n",
        "import plotly.express as px\n",
        "\n",
        "# Compute average inference time\n",
        "time_results = {k: np.mean(v.model_inference_time) * 1e3 for k, v in results.items()}\n",
        "\n",
        "fig = px.bar(x=time_results.keys(), y=time_results.values(),\n",
        "             title=\"Average inference time (ms) for each provider\",\n",
        "             labels={'x':'Provider', 'y':'Avg Inference time (ms)'},\n",
        "             text_auto='.2s')\n",
        "fig.show()"
      ],
      "metadata": {
        "id": "2QmZNhc4DMhq"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Calculate latency and throughput metrics for the 3 benchmark sets and put them into a Pandas DataFrame."
      ],
      "metadata": {
        "id": "HgE7aBCb2l32"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "time_results = {k: np.mean(v.model_inference_time) * 1e3 for k, v in results.items()}\n",
        "time_results_std = {k: np.std(v.model_inference_time) * 1000 for k, v in results.items()}"
      ],
      "metadata": {
        "id": "fOsjKpmlK_PQ"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "perf_results = {}\n",
        "for k, v in results.items():\n",
        "  latency_list = v.model_inference_time\n",
        "  latency_50 = np.percentile(latency_list, 50) * 1e3\n",
        "  latency_75 = np.percentile(latency_list, 75) * 1e3\n",
        "  latency_90 = np.percentile(latency_list, 90) * 1e3\n",
        "  latency_95 = np.percentile(latency_list, 95) * 1e3\n",
        "  latency_99 = np.percentile(latency_list, 99) * 1e3\n",
        "\n",
        "  average_latency = np.mean(v.model_inference_time) * 1e3\n",
        "  throughput = 1 * (1000 / average_latency)\n",
        "\n",
        "  perf_results[k] = (\n",
        "        average_latency,\n",
        "        latency_50,\n",
        "        latency_75,\n",
        "        latency_90,\n",
        "        latency_95,\n",
        "        latency_99,\n",
        "        throughput,\n",
        "    )"
      ],
      "metadata": {
        "id": "Uao1a7ALLAP6"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "import pandas as pd\n",
        "\n",
        "index_labels = ['Average_latency (ms)', 'Latency_P50', 'Latency_P75',\n",
        "                'Latency_P90', 'Latency_P95', 'Latency_P99', 'Throughput']\n",
        "perf_df = pd.DataFrame(data=perf_results, index=index_labels)\n",
        "perf_df"
      ],
      "metadata": {
        "id": "P5AMItfgMQMi"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Visually compare inference durations across benchmarks for the 3 different versions of the model."
      ],
      "metadata": {
        "id": "LOFMzIya283h"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "results_df = pd.DataFrame(columns=['Provider', 'Inference_time'])\n",
        "for k, v in results.items():\n",
        "  for i in range(len(v.model_inference_time)):\n",
        "    results_df.loc[len(results_df.index)] = [k, v.model_inference_time[i] * 1e3]\n",
        "\n",
        "fig = px.box(results_df, x=\"Provider\", y=\"Inference_time\",\n",
        "             points=\"all\",\n",
        "             labels={'Provider':'Provider', 'Inference_time':'Inference durations (ms)'})\n",
        "fig.show()"
      ],
      "metadata": {
        "id": "22IhNSXGPcfC"
      },
      "execution_count": null,
      "outputs": []
    }
  ]
}